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Abstract 

Clustering is one of the most fundamental problems in data analysis and it has been studied extensively in the literature. 
Though many clustering algorithms have been proposed, clustering theories that justify the use of these clustering algorithms are 
still unsatisfactory. In particular, one of the fundamental challenges is to address the following question: 

What is a cluster in a set of data points? 

In this paper, we make an attempt to address such a question by considering a set of data points associated with a distance measure 
(metric). We first propose a new cohesion measure in terms of the distance measure. Using the cohesion measure, we define a 
cluster as a set of points that are cohesive to themselves. For such a definition, we show there are various equivalent statements 
that have intuitive explanations. We then consider the second question: 

How do we find clusters and good partitions of clusters under such a definition? 

For such a question, we propose a hierarchical agglomerative algorithm and a partitional algorithm. Unlike standard hierarchical 
agglomerative algorithms, our hierarchical agglomerative algorithm has a specific stopping criterion and it stops with a partition 
of clusters. Our partitional algorithm, called the iT-sets algorithm in the paper, appears to be a new iterative algorithm. Unlike 
the Lloyd iteration that needs two-step minimization, our K-sets algorithm only takes one-step minimization. 

One of the most interesting findings of our paper is the duality result between a distance measure and a cohesion measure. 
Such a duality result leads to a dual Ff-sets algorithm for clustering a set of data points with a cohesion measure. The dual iT-sets 
algorithm converges in the same way as a sequential version of the classical kernel iT-means algorithm. The key difference is 
that a cohesion measure does not need to be positive semi-definite. 

Index Terms 

Clustering, hierarchical algorithms, partitional algorithms, convergence, Ff-sets, duality 


I. Introduction 

Clustering is one of the most fundamental problems in data analysis and it has a lot of applications in various fields, 
including Internet search for information retrieval, social network analysis for community detection, and computation biology 
for clustering protein sequences. The problem of clustering has been studied extensively in the literature (see e.g., the books 
ID, El and the historical review papers 0, 14]). For a clustering problem, there is a set of data points (or objects) and a 
similarity (or dissimilarity) measure that measures how similar two data points are. The aim of a clustering algorithm is to 
cluster these data points so that data points within the same cluster are similar to each other and data points in different clusters 
are dissimilar. 

As stated in a, clustering algorithms can be divided into two groups: hierarchical and partitional. Hierarchical algorithms 
can further be divided into two subgroups: agglomerative and divisive. Agglomerative hierarchical algorithms, starting from each 
data point as a sole cluster, recursively merge two similar clusters into a new cluster. On the other hand, divisive hierarchical 
algorithms, starting from the whole set as a single cluster, recursively divide a cluster into two dissimilar clusters. As such, there 
is a hierarchical structure of clusters from either a hierarchical agglomerative clustering algorithm or a hierarchical divisive 
clustering algorithm. 

Partitional algorithms do not have a hierarchical structure of clusters. Instead, they find all the clusters as a partition of the 
data points. The Ff-means algorithm is perhaps the simplest and the most widely used partitional algorithm for data points in 
an Euclidean space, where the Euclidean distance serves as the natural dissimilarity measure. The Ff-means algorithm starts 
from an initial partition of the data points into K clusters. It then repeatedly carries out the Lloyd iteration 0 that consists of 
the following two steps: (i) generate a new partition by assigning each data point to the closest cluster center, and (ii) compute 
the new cluster centers. The Lloyd iteration is known to reduce the sum of squared distance of each data point to its cluster 
center in each iteration and thus the Ff-means algorithm converges to a local minimum. The new cluster centers can be easily 
found if the data points are in a Euclidean space (or an inner product space). However, it is in general much more difficult to 
find the representative points for clusters, called medoids, if data points are in a non-Euclidean space. The refined Ff-means 
algorithms are commonly referred as the A"-medoids algorithm (see e.g., 0, 0, 0, ID). As the Ff-means algorithm (or 
the Ff-medoids algorithm) converges to a local optimum, it is quite sensitive to the initial choice of the partition. There are 
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some recent works that provide various methods for selecting the initial partition that might lead to performance guarantees 
El, m, HD, na, ini. Instead of using the Lloyd iteration to minimize the sum of squared distance of each data point to 
its cluster center, one can also formulate a clustering problem as an optimization problem with respect to a certain objective 
function and then solve the optimization problem by other methods. This then leads to kernel and spectral clustering methods 
(see e.g., m, 113, M, El, El and El, M for reviews of the papers in this area). Solving the optimization problems 
formulated from the clustering problems are in general NP-hard and one has to resort to approximation algorithms ED. In 
ED, Balcan et al. introduced the concept of approximation stability that assumes all the partitions (clusterings) that have the 
objective values close to the optimum ones are close to the target partition. Under such an assumption, they proposed efficient 
algorithms for clustering large data sets. 




Fig. 1. A consistent change of 5 clusters. 

Though there are already many clustering algorithms proposed in the literature, clustering theories that justify the use of these 
clustering algorithms are still unsatisfactory. As pointed out in EH, there are three commonly used approaches for developing 
a clustering theory: (i) an axiomatic approach that outlines a list of axioms for a clustering function (see e.g., ED, ED, ED, 
ED, ED, Ell), (ii) an objective-based approach that provides a specific objective for a clustering function to optimize (see 
e.g., ED, ED, and (iii) a definition-based approach that specifies the definition of clusters (see e.g, ED, ED, ED). In ED, 
Kleinberg adopted an axiomatic approach and showed an impossibility theorem for finding a clustering function that satisfies 
the following three axioms: 

(i) Scale invariance: if we scale the dissimilarity measure by a constant factor, then the clustering function still outputs 
the same partition of clusters. 

(ii) Richness: for any specific partition of the data points, there exists a dissimilarity measure such that the clustering 
function outputs that partition. 

(iii) Consistency: for a partition from the clustering function with respect to a specific dissimilarity measure, if we increase 
the dissimilarity measure between two points in different clusters and decrease the dissimilarity measure between two 
points in the same cluster, then the clustering function still outputs the same partition of clusters. Such a change of 
a dissimilarity measure is called a consistent change. 

The impossibility theorem is based on the fundamental result that the output of any clustering function satisfying the scale 
invariance property and the consistency property is in a collection of antichain partitions, i.e., there is no partition in that 
collection that in turn is a refinement of another partition in that collection. As such, the richness property cannot be satisfied. 
In ED, it was argued that the impossibility theorem is not an inherent feature of clustering. The key point in ED is that 
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TABLE I 

List of Notations 


n = {X1,X2, . . . ,Xn} 

The set of all data points 

n 

The total number of data points 

d(x,y) 

The distance between two points x and y 

d{Si,S2) 

The average distance between two sets and S 2 in UOt 

RD(a;| \y) = d{x, y) - d{{x}, fJ) 

The relative distance from x \o y 

RD{y) = {y}) — d(Q,, £7) 

The relative distance from a random point to y 

ROCS'!IIS 2 ) = d{Si,S 2 ) -d{Si,n) 

The relative distance from a set to another set S 2 

l(x, y) = RD(i/) - RD(a;| 1 1 /) 

The cohesion measure between two points x and y 

7(51,52) = Eojgsi Eygss 

The cohesion measure between two sets Si and S 2 

A(x, S) = 2d{{x}, S) - d{S, S) 

The triangular distance from a point a: to a set S 

Q = EfcLi7(5fc,5i;) 

The modularity for a partition , 52, • • •, Sk of 

R=ELi 7(5fc,SG/|Sfc| 

The normalized modularity for a partition 5i, 52, • • •, Sk of £1 


the consistency property may not be a desirable property for a clustering function. This can be illustrated by considering a 
consistent change of 5 clusters in Figure [T] The figure is redrawn from Figure 1 in ll29ll that originally consists of 6 clusters. 
On the left hand side of Figure [T] it seems reasonable to have a partition of 5 clusters. However, after the consistent change, 
a new partition of 3 clusters might be a better output than the original partition of 5 clusters. As such, they abandoned the 
three axioms for clustering functions and proposed another three similar axioms for Clustering-Quality Measures (COM) (for 
measuring the quality of a partition). They showed the existence of a CGM that satisfies their three axioms for CGMs. 

As for the definition-based approach, most of the definitions of a single cluster in the literature are based on loosely defined 
terms IT]. One exception is 1^ . where Ester et al. provided a precise definition of a single cluster based on the concept of 
density-based reachability. A point p is said to be directly density-reachable from another point q if point p lies within the 
e-neighborhood of point q and the e-neighborhood of point q contains at least a minimum number of points. A point is said to 
be density-reachable from another point if they are connected by a sequence of directly density-reachable points. Based on the 
concept of density-reachability, a cluster is defined as a maximal set of points that are density-reachable from each other. An 
intuitive way to see such a definition for a cluster in a set of data points is to convert the data set into a graph. Specifically, 
if we put a directed edge from one point p to another point q if point p is directly density-reachable from point q, then a 
cluster simply corresponds to a strongly connected component in the graph. One of the problems for such a definition is that 
it requires specifying two parameters, e and the minimum number of points in a e-neighborhood. As pointed out in |30l, it is 
not an easy task to determine these two parameters. 

In this paper, we make an attempt to develop a clustering theory in metric spaces. In Section [III we first address the question; 

What is a cluster in a set of data points in metric spaces? 

For this, we first propose a new cohesion measure in terms of the distance measure. Using the cohesion measure, we define a 
cluster as a set of points that are cohesive to themselves. For such a definition, we show in Theorem |7] that there are various 
equivalent statements and these statements can be explained intuitively. We then consider the second question: 

How do we find clusters and good partitions of clusters under such a definition? 

For such a question, we propose a hierarchical agglomerative algorithm in Section |III] and a partitional algorithm Section 
IlYl Unlike standard hierarchical agglomerative algorithms, our hierarchical agglomerative algorithm has a specific stopping 
criterion. Moreover, we show in Theorem |9] that our hierarchical agglomerative algorithm returns a partition of clusters when 
it stops. Our partitional algorithm, called the AT-sets algorithm in the paper, appears to be a new iterative algorithm. Unlike the 
Lloyd iteration that needs two-step minimization, our iT-sets algorithm only takes one-step minimization. We further show in 
Theorem [14] that the AT-sets algorithm converges in a finite number of iterations. Moreover, for AT = 2, the AT-sets algorithm 
returns two clusters when the algorithm converges. 

One of the most interesting findings of our paper is the duality result between a distance measure and a cohesion measure. 
In Section jV] we first provide a general definition of a cohesion measure. We show that there is an induced distance measure, 
called the dual distance measure, for each cohesion measure. On the other hand, there is also an induced cohesion measure, 
called the dual cohesion measure, for each distance measure. In Theorem [18| we further show that the dual distance measure 
of a dual cohesion measure of a distance measure is the distance measure itself. Such a duality result leads to a dual A"-sets 
algorithm for clustering a set of data points with a cohesion measure. The dual AT-sets algorithm converges in the same way as 
a sequential version of the classical kernel AT-means algorithm. The key difference is that a cohesion measure does not need 
to be positive semi-definite. 

In Table H] we provide a list of notations used in this paper. 
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II. Clusters in metric spaces 


A. What is a cluster? 

As pointed out in ||4|, one of the fundamental challenges associated with clustering is to address the following question; 

What is a cluster in a set of data points? 

In this paper, we will develop a clustering theory that formally define a cluster for data points in a metric space. Specifically, 
we consider a set of n data points, = {xi,X 2 , ■ ■ ■ ,Xn} and a distance measure d{x,y) for any two points x and y in fl. 
The distance measure d{-, •) is assumed to a metric and it satisfies 
(Dl) d{x,y)>i)\ 

(D2) d{x,x) = Q\ 

(D3) (Symmetric) d{x,y) = d{y,x); 

(D4) (Triangular inequality) d{x, y) < d{x, z) + d(z, y). 

Such a metric assumption is stronger than the usual dissimilarity (similarity) measures ll^ . where the triangular inequality 
in general does not hold. We also note that (D2) is usually stated as a necessary and sufficient condition in the literature, 
i.e., d{x,y) = 0 if and only if x = y. However, we only need the sufficient part in this paper. Our approach begins with a 
definition-based approach. We first give a specific definition of what a cluster is (without the need of specifying any parameters) 
and show those axiom-like properties are indeed satisfied under our definition of a cluster. 


B. Relative distance and cohesion measure 

One important thing that we learn from the consistent change in Figure [1] is that a good partition of clusters should be 
looked at a global level and the relative distances among clusters should be considered as an important factor. The distance 
measure between any two points only gives an absolute value and it does not tell us how close these two points are relative 
to the whole set of data points. The key idea of defining the relative distance from one point x to another point y is to choose 
another random point z as a reference point and compute the relative distance as the average of d{x^ y) — d{x, z) for all the 
points z in H. This leads to the following definition of relative distance. 

Definition 1 (Relative distance) The relative distance/rom a point x to another point y, denoted by RD{x\\y), is defined as 
follows: 

RD{x\\y) = -J2idix,y) - d{x,z)) 
n 

= d{x,y) --y^d{x,z). ( 1 ) 

n ^' 
zen 

The relative distance (from a random point) to a point y, denoted by RD[y), is defined as the average relative distance from 
a random point to y, i.e., 

RD{y) = -Y,RD{z\\y) 

^ zen 

= “ Z! Z d(z2,zi)- (2) 

Z2€^ Z2G^ ZiG^ 

Note from ([1]) that in general RD(x|| 2 /) is not symmetric, i.e., RD(a::||t/) 7 ^ RD(j/||x). Also, RD(a:|| 2 /) may not be nonnegative. 
In the following, we extend the notion of relative distance from one point to another point to the relative distance from one 
set to another set. 

Definition 2 (Relative distance) The relative distance from a set of points Si to another set of points S 2 , denoted by 
RD(S'i||iS' 2 ), is defined as the average relative distance from a random point in Si to another random point in S 2 , i.e., 

RD{Si\\S 2 ) = ^ ^ RD{x\\y). (3) 

\Si\-\S2\ 

Based on the notion of relative distance, we define a cohesion measure for two points x and y below. 

Definition 3 (Cohesion measnre between two points) Define the cohesion measure between two points x and y, denoted by 
y{x,y), as the difference of the relative distance to y and the relative distance from x to y, i.e., 

j{x, y) = RD{y) - RD{x\\y). 


(4) 
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Two points X and y are said to be cohesive (resp. incohesivej if^{x,y) > 0 (resp. 'y{x,y) < OJ. 

In view of (|4]i, two points x and y are cohesive if the relative distance from a; to y is not larger than the relative distance 
(from a random point) to y. 

Note from ([T]i and (| 2 l) that 


7(x,y) 


RD(y) - RD(xlly) 

- d{z2,y) + - XI 

Z2GO. 

d{z2,Zi) - d{x,y) 
^ X X {d{x,zi) + d{z2,y) 

Z2 £ ^ O 

-d{zi,Z2) -d{x,y)y 


(5) 


( 6 ) 


Though there are many ways to define a cohesion measure for a set of data points in a metric space, our definition of the 
cohesion measure in Definition [3] has the following four desirable properties. Its proof is based on the representations in Q 
and (|6l) and it is given in Appendix A. 


Proposition 4 (i) (Symmetry) The cohesion measure is symmetric, i.e., ^{x,y) = y{y,x). 

(ii) (Self-cohesiveness) Every data point is cohesive to itself, i.e., ^{x,x) > 0. 

(iii) (Self-centredness) Every data point is more cohesive to itself than to another point, i.e., j(x,x) > "/{x,y) for all 
y € fl. 

(iv) (Zero-sum) The sum of the cohesion measures between a data point to all the points in O is zero, i.e., V) = 

0 . 


These four properties can be understood intuitively by viewing a cohesion measure between two points as a “binding 
force” between those two points. The symmetric property ensures that the binding force is reciprocated. The self-cohesiveness 
property ensures that each point is self-binding. The self-centredness property further ensures that the self binding force is 
always stronger than the binding force to the other points. In view of the zero-sum property, we know for every point x there 
are points that are incohesive to x and each of these points has a negative binding force to x. Also, there are points that are 
cohesive to x (including x itself from the self-cohesiveness property) and each of these points has a positive force to x. As 
such, the binding force will naturally “push” points into “clusters.” 

To further understand the intuition of the cohesion measure, we can think of zi and Z 2 in (| 6 ]l as two random points that 
are used as reference points. Then two points x and y are cohesive if d(a;,zi) -f d{z 2 ,y) > d{zi,Z 2 ) -f d{x,y) for two 
reference points zi and Z 2 that are randomly chosen from El. In Figure |2] we show an illustrating example for such an 
intuition in Tlf. In Figure |2 a), point x is close to one reference point zi and point y is close to the other reference point 
Z 2 . As such, d{x,Zi) -\- d{z 2 ,y) < d{zi,Z 2 ) and thus these two points x and y are incohesive. In Figure |2tb), point x is 
not that close to zi and point y is not that close to Z 2 . However, x and y are on the two opposite sides of the segment 
between the two reference points zi and Z 2 . As such, there are two triangles in this graph: the first triangle consists of the 
three points x,zi, and w, and the second triangle consists of the three points y,Z 2 , and w. From the triangular inequality, 
we then have d{w,zi) d{x,w) > d{x,zi) and d{y,w) -\-d{w,Z 2 ) > d{y,Z 2 ). Since d(^w,zi) -\-d{w,Z 2 ) = d{zi,Z 2 ) and 
d{x,w) -\- d{y,w) = d{x,y), it then follows that d{zi,Z 2 ) + d{x,y) > d{x,zi) -f d(y, Z 2 ). Thus, points x and y are also 
incohesive in Figure |2jb). In Figure |2c), point x is not that close to zi and point y is not that close to Z 2 as in Figure |2b). 
Now X and y are on the same side of the segment between the two reference points zi and Z 2 . There are two triangles in 
this graph: the first triangle consists of the three points x, y, and w, and the second triangle consists of the three points zi, Z 2 , 
and w. From the triangular inequality, we then have d{x,w) -\-d(^w,y) > d{x,y) and d{w,zi) -\-d{w,Z 2 ) > d{zi,Z 2 ). Since 
d{x, w) -\- d{w, zi) = d{x, zi) and d{w, y) d{w, Z 2 ) = d{z 2 ,y), it then follows that d{x, zf) -\- d{y, Z 2 ) > d(zi, Z 2 ) -f d{x, y). 
Thus, points x and y are cohesive in Figure I2c). In view of Figure I2c), it is intuitive to see that two points x and y are 
cohesive if they both are far away from the two reference points and they both are close to each other. 

The notions of relative distance and cohesion measure are also related to the notion of relative centrality in our previous 
work 041 . To see this, suppose that we sample two points x and y from El according to the following bivariate distribution: 


p{x,y) 


^-9d{x,y) 


V V p-ed(u,v) ’ 


6» > 0. 


(7) 


Let Px{x) = Y^yenPi^^y) Priv) = Yxen p(x, y) be the two marginal distributions. Then one can verify that the 
covariance p{x, y) — Px{x)PY(y) is proportional to the cohesion measure 7 ( 0 ;, y) when 0 4-0. Intuitively, two points x and y 
are cohesive if they are positively correlated according to the sampling in (|7]i when 9 is very small. 
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(a) 


(b) 


(c) 



Fig. 2 . Illustrating examples of the cohesion measure in 'R.^: (a) incohesive as d{x, ^i) + d(y, Z2) < d{zi, Z2), (b) incohesive as d{w, zi) + d{x, w) > 
d{x, zi) and d{y, m) + d{w,Z2) > d{y, Z2), and (c) cohesive d(x, tu) + d{'w, y) > d{x, y) and d{w, zi) + d{w, Z2) > d{zi ,22). 


Now we extend the cohesion measure between two points to the cohesion measure between two sets. 

Definition 5 (Cohesion measure between two sets) Define the cohesion measure between two sets S\ and S 2 , denoted by 
7 (S'i, S 2 ), as the sum of the cohesion measures of all the pairs of two points (with one point in Si and the other point in S 2 ), 
i.e., 

7(<S'i,5'2)= ^ ^"f{x,y). (8) 

xeSi yGS2 

Two sets Si and S 2 are said to be cohesive (resp. incohesive) if y(^Si, S 2 ) > 0 (resp. 'y{Si,S 2 ) < 0). 

C. Equivalent statements of clusters 
Now we define what a cluster is in terms of the cohesion measure. 

Definition 6 (Cluster) A nonempty set S is called a cluster if it is cohesive to itself i.e., 

l{S,S)>Q. (9) 

In the following, we show the first main theorem of the paper. Its proof is given in Appendix B. 

Theorem 7 Consider a nonempty set S that is not equal to D. Let S^^ = D,\S be the set of points that are not in S. Also, let 
d{Si,S 2 ) be the average distance between two randomly selected points with one point in Si and another point in S 2 , i.e., 

(i») 

' ' ' ' xeSi yeS2 

The following statements are equivalent. 

(i) The set S is a cluster, i.e., 'y{S, S) > 0. 

(ii) The set 5'° is a cluster, i.e., 7 ( 5 °, 5°) > 0. 

(iii) The two sets S and 5° are incohesive, i.e., 7 ( 5 , 5°) < 0. 

(iv) The set 5 is more cohesive to itself than to 5°, i.e., 7 ( 5 , 5 ) > 7 ( 5 , 5 °). 
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ALGORITHM 1: The Hierarchical Agglomerative Algorithm 
Input: A data set fl = {xi,X 2 ,... ,Xn] and a distance measure d(-, •). 

Output: A partition of clusters {S'!, S' 2 ,..., Sk}- 
Initially, K = n; Si = {xi}, t = 1, 2,..., n; 

Compute the cohesion measures ^{Si, Sj) = j{xi, Xj) for all i,j = 1, 2,..., n; 
while there exists some i and j with Sj) > 0 do 
Merge Si and Sj into a new set Sk, i.e., Sk = SiU Sj', 
jiSk,Sk) = 7 ( 5 ., 5 ,) + 27 ( 5 ,, s,) + jiSj,s,y, 
for each i k do 

I ^iSk,Se)=^{Si,Sk)=7iS,,Se) + jiSj,Si)', 

end 

K = K -h 

end 

Reindex the K remaining sets to {Si, S 2 , ■ ■ ■, Sk}', 


(v) 2 d{s, n) - d{n, n) - d{s, s) > o. 

(vi) The relative distance from VL to S is not smaller than the relative distance from S to S, i.e., RD(r2||S') > RD{S\\S). 

(vii) The relative distance from to S is not smaller than the relative distance from S to S, i.e., RD(S'°||S') > RD{S\\S). 
(viii) 2diS, S^) - d{S, S) - d{Sf S^) > 0. 

(ix) The relative distance from S to S'^ is not smaller than the relative distance from to S'^, i.e., RD(S'| |S''^) > RD(n| |>S''^). 

(x) The relative distance from S‘^ to S is not smaller than the relative distance from O to S, i.e., RZ)(S'“||S') > RD(0||S'). 

One surprise finding in Theorem |2tii) is that the set S‘^ is also a cluster. This shows that the points inside S are cohesive 
and the points outside S are also cohesive. Thus, there seems a boundary between S and S‘^ from the cohesion measure. 
Another surprise finding is in Theorem |7j viii). One usually would expect that a cluster S should satisfy d{S, S) < d{S, 5°). 
But it seems our definition of a cluster is much weaker than that. Regarding the scale invariance property, it is easy to see 
from Theorem |2lviii) that the inequality there is still satisfied if we scale the distance measure by a constant factor. Thus, a 
cluster of data points is still a cluster after scaling the distance measure by a constant factor. Regarding the richness property, 
we argue that there exists a distance measure such that any subset of points in O is a cluster. To see this, we simple let the 
distance between any two points in the subset be equal to 0 and the distance between a point outside the subset to a point 
in the subset be equal to 1. Since a point x itself is a cluster, i.e., 'y(x,x) > 0, we then have 'y{x,y) = j(x,x) > 0 for any 
two points X and y in the subset. From (|9]), the subset is a cluster under such a choice of the distance measure. Furthermore, 
one can also see from Theorem |7I vii) that for a cluster S, if we decrease the relative distance between two points in S and 
increase the relative distance between one point in S and another point in S‘^, then the set S is still a cluster under such a 
’’consistent” change. 

We also note that in our proof of Theorem|2]we only need df, •) to be symmetric. As such, the results in Theorem [T] also 
hold even when the triangular inequality is not satisfied. 

III. A HIERARCHICAL AGGLOMERATIVE ALGORITHM 

Once we define what a cluster is, our next question is 

How do we find clusters and good partitions of clusters? 

For this, we turn to an objective-based approach. We will show that clusters can be found by optimizing two specific objective 
functions by a hierarchical algorithm in Section |III] and a partitional algorithm in Section |IV] 

In the following, we first define a quality measure for a partition of H. 

Definition 8 (Modularity) Let Sk, k = 1,2,... ,K, be a partition ofLt = {xi,X 2 ,. ■ ■, Xn}, i.e., Sk H Sk' is an empty set for 
k k' and ^k=i^k = The modularity index Q with respect to the partition Sk, k = 1,2,..., K, is defined as follows: 

K 

Q = Y.l{Sk,Sk). 


( 11 ) 
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Based on such a quality measure, we can thus formulate the clustering problem as an optimization problem for finding a 
partition 5'i, 52,..., Sk (for some unknown K) that maximizes the modularity index Q. Note that 

K K 

q=Y1 7(5'fc, 5'fc)= X] y] y] y) 

k—l k—1xGSky^Sk 

— ^ ^ ^ ^ 7(^j y)^c(x),c(y); (12) 

where c{x) is the cluster of x and Si,(x),c{y) = 1 if a; and y are in the same cluster. In view of (fT2l i. another way to look at 
the optimization problem is to find the assignment of each point to a cluster. However, it was shown in lIlSl that finding the 
optimal assignment for modularity maximization is NP-complete in the strong sense and thus heuristic algorithms, such as 
hierarchical algorithms and partitional algorithms are commonly used in the literature for solving the modularity maximization 
problem. 

In Algorithm [T] we propose a hierarchical agglomerative clustering algorithm that converges to a local optimum of this 
objective. The algorithm starts from n clusters with each point itself as a cluster. It then recursively merges two disjoint 
cohesive clusters to form a new cluster until either there is a single cluster left or all the remaining clusters are incohesive. 
There are two main differences between a standard hierarchical agglomerative clustering algorithm and ours: 

(i) Stopping criterion: in a standard hierarchical agglomerative clustering algorithm, such as single linkage or complete 
linkage, there is no stopping criterion. Here our algorithm stops when all the remaining clusters are incohesive. 

(ii) Greedy selection: our algorithm only needs to select a pair of cohesive clusters to merge. It does not need to be the 
most cohesive pair. This could potentially speed up the algorithm in a large data set. 

In the following theorem, we show that the modularity index Q in (fTTI) is non-decreasing in every iteration of the hierarchical 
agglomerative clustering algorithm and it indeed produces clusters. Its proof is given in Appendix C. 

Theorem 9 (i) Every set returned by the hierarchical agglomerative clustering algorithm is indeed a cluster. 

(ii) For the hierarchical agglomerative clustering algorithm, the modularity index is non-decreasing in every iteration 
and thus converges to a local optimum. 

As commented before, our algorithm only requires to find a pair of cohesive clusters to merge in each iteration. This is 
different from the greedy selection in H], Chapter 13, and Certainly, our hierarchical agglomerative clustering algorithm 
can also be operated in a greedy manner. As in 1^ . in each iteration we can merge the two clusters that result in the 
largest increase of the modularity index, i.e., the most cohesive pair. It is well-known (see e.g., the book ||2l) that a naive 
implementation of a greedy hierarchical agglomerative clustering algorithm has 0{n^) computational complexity and the 
computational complexity can be further reduced to 0{n^ log(n)) if priority queues are implemented for the greedy selection. 
We also note that there are several hierarchical agglomerative clustering algorithms proposed in the literature for community 
detection in networks (see e.g., ED, JMl, ED, Eol). These algorithms are also based on “modularity” maximization. Among 
them, the fast unfolding algorithm in ESl is the fast one as there is a second phase of building a new (and much smaller) 
network whose nodes are the communities found during the previous phase. The Newman and Girvan modularity in ETl is 
based on a probability measure from a random selection of an edge in a network (see 041 for more detailed discussions) and 
this is different from the distance metric used in this paper. 

In the following, we provide an illustrating example for our hierarchical agglomerative clustering algorithm by using the 
greedy selection of the most cohesive pair. 


Example 10 (Zachary’s karate club) As in 071 . Il42l . we consider the Zachary’s karate club friendship network Bdl in 
Figure |3] The set of data was observed by Wayne Zachary ED over the course of two years in the early 1970s at an American 
university. During the course of the study, the club split into two groups because of a dispute between its administrator (node 
34 in Figure ID and its instructor (node 1 in Figure |D- 

In Figure 6, we show the dendrogram generated by using our hierarchical agglomerative clustering algorithm with the greedy 
selection of the most cohesive pair in each iteration. The distance measure is the geodesic distance of the graph in Figure 
ID The algorithm stops when there are three incohesive clusters left, one led by the administrator (node 34), one led by the 
instructor (node 1), and person number 9 himself. According to ED, there was an interesting story for person number 9. He 
was a weak supporter for the administrator. However, he was only three weeks away from a test for black belt (master status) 
when the split of the club occurred. He would have had to give up his rank if he had joined the administrator’s club. He ended 
up with the instructor’s club. We also run an additional step for our algorithm (to merge the pair with the largest cohesive 
measure) even though the remaining three clusters are incohesive. The additional step reveals that person number 9 is clustered 
into the instructor’s club. 
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Fig. 3. The network of friendships between individuals in the karate club study of Zacharv l4H . The instructor and the administrator are represented by 
nodes 1 and 34, respectively. Squares represent individuals who ended up with the administrator and circles represent those who ended up with the instmctor. 


Dendrogram of the Zachary’s karate club 



Fig. 4. The dendrogram from our greedy hierarchical agglomerative clustering algorithm for the Zachary karate club friendship network. 


IV. A PARTITIONAL ALGORITHM 


A. Triangular distance 

In this section, we consider another objective function. 


Definition 11 (normalized modularity) Let Sk, k = 1,2,... ,K, be a partition ofLL = {a:i,a; 2 ,..., Xn}, i-e., Sk H Sk' is an 
empty set for k k' and U^j^S'fc = Ll. The normalized modularity index R with respect to the partition Sk, k = 1,2,..., K, 
is defined as follows.■ 

^ 1 

^ = (13) 

Unlike the hierarchical agglomerative clustering algorithm in the previous section, in this section we assume that K is fixed 
and known in advance. As such, we may use an approach similar to the classical AT-means algorithm by iteratively assigning 
each point to the nearest set (until it converges). Such an approach requires a measure that can measure how close a point x 
to a set S is. In the AT-means algorithm, such a measure is defined as the square of the distance between x and the centroid 
of S. However, there is no centroid for a set in a non-Euclidean space and we need to come up with another measure. 

Our idea for measuring the distance from a point a; to a set S is to randomly choose two points Zi and Z 2 from S and consider 
the three sides of the triangle x, zi and Z 2 . Note that the triangular inequality guarantees that d{x, Zi)+d{x, Z 2 ) — d{zi, Z 2 ) > 0. 
Moreover, if x is close to zi and Z 2 , then d{x, zi) + d{x, Z 2 ) — d{zi, Z 2 ) should also be small. We illustrate such an intuition 
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Fig. 5. An illustration of the triangulai' distance in "R?. 


in Figure |5] where there are two points x and y and a set S in Ti?. Such an intuition leads to the following definition of 
triangular distance from a point a; to a set S. 

Definition 12 (Triangnlar distance) The triangular distance from a point x to a set S, denoted by A(a;, S), is defined as 
follows: 

= 1^ X] X ^ 2 ) - d{zi,Z2)^ . (14) 

In the following lemma, we show several properties of the triangular distance and its proof is given in Appendix D. 


Lemma 13 (i) 

A(a:, S') = 2J({x}, S') - J(5', S') > 0. (15) 


(ii) 


A(a:, S) = 7(0:, x) - S) + |J|^7('S', S). 


(iii) Let Sk, k = 1^2^... ^ K, be a partition of Lt = {xi, X 2 , - ■ -, Xn}- Then 


K 


X X = X 

k—1 x£Sk 


(16) 


(17) 


(iv) Let Sk, k = 1,2^..., K, be a partition ofLL = {xi, X 2 , ■ • ■, Xn\ and c(a;) be the index of the set to which x belongs, 
i.e., X G Sc{x)- Then 


K 


X X 

k—1x€Sk 


K 

X X ^({2;},-S'fe) 

k—1xGSk 


(18) 


The first property of this lemma is to represent triangular distance by the average distance. The second property is to 
represent the triangular distance by the cohesion measure. Such a property plays an important role for the duality result in 
Section [V] The third property shows that the optimization problem for maximizing the normalized modularity R is equivalent 
to the optimization problem that minimizes the sum of the triangular distance of each point to its set. The fourth property 
further shows that such an optimization problem is also equivalent to the optimization problem that minimizes the sum of 
the average distance of each point to its set. Note that (i({a;},S'fc) = d,{x,y). The objective for maximizing the 

normalized modularity R is also equivalent to minimize 


X 


1 

1 ^ 


X X 

xGSk y&Sk 


This is different from the A"-median objective, the AT-means objective and the min-sum objective addressed in IItTII . 
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ALGORITHM 2: The K-sets Algorithm 

Input: A data set fl = {xi,X 2 , ■ ■ ■, Xn}, a distance measure d{-, •), and the number of sets K. 
Output: A partition of sets {^i, S 2 , ■ ■ ■, Sk}- 

(0) Initially, choose arbitrarily K disjoint nonempty sets , Sk as a partition of 

(1) for z = 1,2,..., n do 

Compute the triangular distance A{xi,Sk) for each set Sk by using (fTsT l. 

Find the set to which the point Xi is closest in terms of the triangular distance. 

Assign point Xi to that set. 

end 

(2) Repeat from (1) until there is no further change. 




Fig. 6. Two rings: (a) a clustering result by the iC-means algorithm, and (b) a clustering result by the iC-sets algorithm. 


B. The K-sets algorithm 

In the following, we propose a partitional clustering algorithm, called the iF-sets algorithm in Algorithm |2] based on the 
triangular distance. The algorithm is very simple. It starts from an arbitrary partition of the data points that contains K disjoint 
sets. Then for each data point, we assign the data point to the closest set in terms of the triangular distance. We repeat the 
process until there is no further change. Unlike the Lloyd iteration that needs two-step minimization, the iL-sets algorithm 
only takes one-step minimization. This might give the K-sets algorithm the computational advantage over the iL-medoids 
algorithms Jh), HI, Q, H). 

In the following theorem, we show the convergence of the K-sets algorithm. Moreover, for K = 2, the iL-sets algorithm 
yields two clusters. Its proof is given in Appendix E. 

Theorem 14 (i) In the K-sets algorithm based on the triangular distance, the normalized modularity is increasing when 

there is a change, i.e., a point is moved from one set to another. Thus, the algorithm converges to a local optimum 
of the normalized modularity. 

(ii) Let Si, S 2 , ■ ■ ■, Sk be the K sets when the algorithm converges. Then for all i j, the two sets Si and Sj are two 
clusters if these two sets are viewed in isolation (by removing the data points not in St U Sj from fl). 

An immediate consequence of Theorem [14] (ii) is that for K = 2, the two sets Si and S 2 are clusters when the algorithm 
converges. However, we are not able to show that for K >i the K sets, Si,S 2 ,. ■., Sk, are clusters in U. On the other hand, 
we are not able to find a counterexample either. All the numerical examples that we have tested for K > 3 yield K clusters. 

C. Experiments 

In this section, we report several experimental results for the iL-sets algorithm: including the dataset with two rings in 
Section IIV-CII the stochastic block model in Section IIV-C2I and the mixed National Institute of Standards and Technology 
dataset in Section IIV-C3 1 

1) Two rings: In this section, we first provide an illustrating example for the iL-sets algorithm. 

In Figure |6| we generate two rings by randomly placing 500 points in TZ^. The outer (resp. inner) ring consists of 300 (resp. 
200) points. The radius of a point in the outer (resp. inner) ring is uniformly distributed between 20 and 22 (resp. 10 and 12). 
The angle of each point is uniformly distributed between 0 and 2?!. In Figure jbja), we show a typical clustering result by using 
the classical AT-means algorithm with K = 2. As the centroids of these two rings are very close to each other, it is well-known 
that the AT-means algorithm does not perform well for the two rings. Instead of using the Euclidean distance as the distance 
measure for our AT-sets algorithm, we first convert the two rings into a graph by adding an edge between two points with the 
Euclidean distance less than 5. Then the distance measure between two points is defined as the geodesic distance of these two 
points in the graph. By doing so, we can then easily separate these rings by using the AT-sets algorithm with AT = 2 as shown 
in Figure Ibjb). 
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The purpose of this example is to show the limitation of the applicability of the iT-means algorithm. The data points for 
the iT-means algorithm need to be in some Euclidean space. On the other hand, the data points for the iT-sets algorithms 
only need to be in some metric space. As such, the distance matrix constructed from a graph cannot be directly applied by 
the 7T-means algorithm while it is still applicable for the AT-sets algorithm. 

2) Stochastic block model: The stochastic block model (SBM), as a generalization of the Erdos-Renyi random graph ||43]| . 
is a commonly used method for generating random graphs that can be used for benchmarking community detection algorithms 
mi, ED. In a stochastic block model with q blocks (communities), the total number of nodes in the random graph are evenly 
distributed to these q blocks. The probability that there is an edge between two nodes within the same block is and the 
probability that there is an edge between two nodes in two different blocks is Pout- These edges are generated independently. 
Let Cin = n ■ Pin , Cout = ti ■ Pout- Then it is known ll44l that these q communities can be detected (in theory for a large 
network) if 

\cin - Cout \ > qs /mean degree. (19) 

In this paper, we use MODE-NET ||45]| to run SBM. Specifically, we consider a stochastic block model with two blocks. 
The number of nodes in the stochastic block model is 1,000 with 500 nodes in each of these two blocks. The average degree 
of a node is set to be 3. The values of Cin — Cout of these graphs are in the range from 2.5 to 5.9 with a common step of 0.1. 
We generate 20 graphs for each Cm — Cout- Isolated vertices are removed. Thus, the exact numbers of vertices used in this 
experiment are slightly less than than 1,000. 

We compare our AT-sets algorithm with some other community detection algorithms, such as OSLOM2 Bbll . infomap BTlI . 
mi, and fast unfolding ||3^ . The metric used for the AT-sets algorithm for each sample of the random graph is the resistance 
distance, and this is pre-computed by NumPy 1491 . The resistance distance matrix (denoted by i? = {Rij)) can be derived 
from the pseudo inverse of the adjacency matrix (denoted by E = (Eij)) as follows; l50l : 

+ Tjj - Ejj - Tj^i, otherwise. 

The AT-sets algorithm and OSLOM2 are implemented in C-H-, and the others are all taken from igraph ED and are implemented 
in C with python wrappers. In Table HIl we show the average running times for these four algorithms over 700 trials. The 
pre-computation time for the AT-sets algorithm is the time to compute the distance matrix. Except infomap, the other three 
algorithms are very fast. In Eigure |2l we compute the normalized mutual information measure (NMI) by using a built-in 
function in igraph ETl for the results obtained from these four algorithms. Each point is averaged over 20 random graphs from 
the stochastic block model. The error bars are the 95% confidence intervals. In this stochastic block model, the theoretical 
phase transition threshold from (fl^ is Cm — Cout = 3.46. It seems that the A'-sets algorithm is able to detect these two blocks 
when Cin — Cout > 4.5. Its performance in that range is better than infomap ll47ll . ll48l . fast unfolding 13^ and OSLOM2 Bbl . 
We note that the comparison is not exactly fair as the other three algorithms do not have the information of the number of 
blocks (communities). 


TABLE II 

Average running time (in seconds). 



infomap 

fast unfolding 

OSLOM2 

X-sets 

pre-computation 

0 

0 

0 

2.3096 

running 

0.7634 

0.0074 

0.0059 

0.0060 

total 

0.7634 

0.0074 

0.0059 

2.3156 


3) Mixed National Institute of Standards and Technology dataset: In this section, we consider a real-world dataset, the 
mixed National Institute of Standards and Technology dataset (the MNIST dataset) E2l . The MNIST dataset contains 60,000 
samples of hand-written digits. These samples are 28x28 pixels grayscale images (i.e., each of the image is a 784 dimensional 
data point). Eor our experiments, we select the first 1,000 samples from each set of the digit 0 to 9 to create a total number 
of 10,000 samples. 

To fairly evaluate the performance of the AT-sets algorithm, we compare the AT-sets algorithm with two clustering algorithms 
in which the number of clusters is also known a priori, i.e., the AT-means-H- algorithm and the A'-medoids algorithm 
E3- Eor the MNIST dataset, the number of clusters is 10 (for the ten digits, 0,1,2,... ,9). The AT-means-H- algorithm is 
an improvement of the standard AT-means algorithm with a specific method to choose the initial centroids of the K clusters. 
Like the AT-sets algorithm, the A'-medoids algorithm is also a clustering algorithm that uses a distance measure. The key 
difference between the AT-sets algorithm and the A'-medoids algorithm is that we use the triangular distance to a set for 
the assignment of each data point and the AT-medoids algorithm uses the distance to a medoid for such an assignment. The 
Euclidean distance between two data points (samples from the MNIST dataset) for the AT-medoids algorithm and the A'-sets 
algorithm are pre-computed by NumPy 1491 . The AT-sets algorithm is implemented in C-H-, and the others are implemented 
in C with python wrappers. All the programs are executed on an Acer Altos-T350-E2 machine with two Intel(R) Xeon(R) 
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Fig. 7. Compaiison of infomap Ezl, EU, fast unfolding (38), OSLOM2 (46) and K-sets for the stochastic block model with two blocks. Each point is 
averaged over 20 such graphs. The error bars are the 95% confidence intervals. The theoretical phase transition threshold in this case is 3.46. 


0.54 

0.52 

0.5 

_ 0.48 
S 

^ 0.46 
0.44 
0.42 
0.4 



K-means++ K-medoids K-sets 


Fig. 8. Compaiison of A'-means-i"i- (111, iC-medoids ns and iC-sets for the MNIST dataset. Each point is averaged over 100 trials. The error bars are the 
95% confidence intervals. 

CPU E5-2690 v2 processors. In order to have a fair comparison of their running times, the parallelization of each program is 
disabled, i.e., only one core is used in these experiments. We assume that the input data is already stored in the main memory 
and the time consumed for I/O is not recorded. 

In Table HID we show the average running times for these three algorithms over 100 trials. Both the iT-medoids algorithm 
and the K-sets algorithm need to compute the distance matrix and this is shown in the row marked with the pre-computation 
time. The total running times for these three algorithms are roughly the same for this experiment. In Figure [8] we compute 
the normalized mutual information measure (NMI) by using a built-in function in igraph ifsTll for the results obtained from 
these three algorithms. Each point is averaged over 100 trials. The error bars are the 95% confidence intervals. In view of 
Figure [8] the JC-sets algorithm outperforms the itT-means-H- algorithm and the it'-medoids algorithm for the MNIST dataset. 
One possible explanation for this is that both the the iT-means-H- algorithm and the itT-medoids algorithm only select a single 
representative data point for a cluster and that representative data point may not be able to represent the whole cluster well 
enough. On the other hand, the K-sets algorithm uses the triangular distance that takes the distance to every point in a cluster 
into account. 


TABLE III 

Average running time (in seconds). 



A'-means-i-i- 

K -medoids 

iT-sets 

pre-computation 

0 

36.981 

36.981 

mnning 

49.940 

1.228 

1.801 

total 

49.940 

38.209 

38.782 
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V. Duality between a cohesion measure and a distance measure 

A. The duality theorem 

In this section, we show the duality result between a cohesion measure and a distance measure. In the following, we first 
provide a general definition for a cohesion measure. 

Definition 15 A measure between two points x and y, denoted by /3{x,y), is called a cohesion measure/or a set of data 
points if it satisfies the following three properties: 

(Cl) (Symmetry) f]{x, y) = j3{y, x) for all x,y G ft. 

(C2) (Zero-sum) For all x G ft, Tj) = 0. 

(C3) (Triangular inequality) For all x,y,z in ft, 

I3{x, x) + /3{y, z) - I3{x, z) - j3{x, y) > 0. (20) 

In the following lemma, we show that the specific cohesion measure defined in Section [H] indeed satisfies (C1)-(C3) in 
Definition [15] Its proof is given in Appendix F. 

Lemma 16 Suppose that d{-, •) is a distance measure for ft, i.e., (i(-, •) satisfies (D1)-(D4). Let 

P{x,y) = - V d{z2,y) + - X! 

21^0 

^ d{z 2 ,zi) - d{x,y). (21) 

22€r2 2iGn 

Then P(x,y) is a cohesion measure for ft. 

We know from (|5]l that the cohesion measure 7(-, •) defined in Section HIl has the following representation: 

y) = i V d{z2, y) + -Y] d{x, Zi) 

'^d{z2,zi)-d(^x,y). 

22 2i 

As a result of Lemma [ThI it also satisfies (C1)-(C3) in Definition [15] As such, we call the cohesion measure 7 (-, •) defined in 
Section HU the dual cohesion measure of the distance measure df, •). 

On the other hand, if P{x,y) is a cohesion measure for ft, then there is an induced distance measure and it can be viewed 
as the dual distance measure of the cohesion measure /3(x, y). This is shown in the following lemma and its proof is given in 
Appendix G. 

Lemma 17 Suppose that j3{-, •) is a cohesion measure for O. Let 

d{x, y) = {/3{x, x) + I3{y, y))/2 - I3{x, y). (22) 

Then d{-, ■) is a distance measure that satisfies (D1)-(D4). 

In the following theorem, we show the duality result. Its proof is given in Appendix H. 

Theorem 18 Consider a set of data points ft. For a distance measure d{-, •) that satisfies (D1)-(D4), let 

d*{x,y) = - V d{z 2 ,y) + - Y] d{x,Zi) 

22^0 2iEO 

d{z 2 ,zi) - d{x,y) (23) 

22^0 2iGn 

be the dual cohesion measure of d{-, •). On the other hand. For a cohesion measure /?(•, •) that satisfies (C1)-(C3), let 

I3*ix,y) = il3ix,x) + (3{y,y))/2- I3{x,y) 

be the dual distance measure of ■). Then d**{x,y) = d{x,y) and fi**(^x,y) = fi(^x,y) for all x,y G ft. 


( 24 ) 
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ALGORITHM 3: The dual K-sets Algorithm 

Input: A data set fl = {xi,X2, ..., a:„}, a cohesion measure 7 (-, •), and the number of sets K. 
Output: A partition of sets {^i, S2, ■ ■ ■, Sk}- 

( 0 ) Initially, choose arbitrarily K disjoint nonempty sets , Sk as a partition of 

(1) for z = 1,2,..., n do 

Compute the triangular distance A{xi,Sk) for each set Sk by using (l26l l. 

Find the set to which the point Xi is closest in terms of the triangular distance. 

Assign point Xi to that set. 

end 

(2) Repeat from (1) until there is no further change. 


B. The dual K-sets algorithm 

For the K-sets algorithm, we need to have a distance measure. In view of the duality theorem between a cohesion measure 
and a distance measure, we propose the dual K-sets algorithm in Algorithm [3] that uses a cohesion measure. As before, for a 
cohesion measure 7 (-, •) between two points, we define the cohesion measure between two sets Si and S 2 as 

7 (^ 1 , ^ 2 ) = EE l{x,y). (25) 

x^Si yGS2 

Also, note from (fThl) that the triangular distance from a point a; to a set S can be computed by using the cohesion measure as 
follows: 

A(a;, S) = j{x, x) - |^7({2;}, 5*) + |^7(5', S). (26) 

As a direct result of the duality theorem in Theorem [18] and the convergence result of the itT-sets algorithm in Theorem [l4] 
we have the following convergence result for the dual K-sets algorithm. 

Corollary 19 As in M3^ . we define the normalized modularity as dual K-sets algorithm, the 

normalized modularity is increasing when there is a change, i.e., a point is moved from one set to another. Thus, the algorithm 
converges to a local optimum of the normalized modularity. Moreover, for K = 2, the dual K-sets algorithm yields two clusters 
when the algorithm converges. 


C. Connections to the kernel K-means algorithm 

In this section, we show the connection between the dual K-sets algorithm and the kernel iC-means algorithm in the literature 
(see e.g., mr Let us consider the nxn matrix F = ( 7 ^ j) with = y{xi, Xj ) being the cohesion measure between Xi and Xj. 
Call the matrix F the cohesion matrix (corresponding to the cohesion measure 7 (-, •)). Since y{xi,Xj) = y{xj,Xi), the matrix 
F is symmetric and thus have real eigenvalues Xk,k = 1,2,... ,n. Let I be the nxn identity matrix and tr > — mini<fc<„ Afc. 
Then the matrix F = trl + F is positive semi-definite as its n eigenvalues Afc = cr -f Afc, k = 1,2,... ,N are all nonnegative. 
Let Vk = ivk,i,Vk, 2 , ■ ■ ■, Vk,nf', k = 1,2,... ,n he the eigenvector of F corresponding to the eigenvalue Afc. Then Vk is also 
the eigenvector of F corresponding to the eigenvalue Afc. Thus, we can decompose the matrix F as follows: 


f = ^ AfeUfcUfe , 
fc=i 


where is the transpose of Vk- Now we choose the mapping cj) : il TZ^ as follows: 


for z = 1, 2 ..., n. Note that 



4>ixi)'^ ■ (j){Xj) 


n 

^ ^ XkVkjiVkjj 
k=l 


{f)ij = aSij -h y{xi,Xj), 


where 5ij = 1 if z = j and 0 otherwise. 

The “centroid” of a set S can be represented by the corresponding centroid in TZ^, i.e.. 


jij 

' ' yes 


(27) 


(28) 


(29) 


( 30 ) 
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and the square of the “distance” between a point x and the “centroid” of a set S is 

1 


|5| ^ 


yes 


' ' yes 




yes 


+ ^5151 </>(yir-</'( 2 / 2 ) 

' ' yieSyseS 

= l{x,x) + a - 

+ ^ 51 51 7 ( 2 / i , 2 / 2 ) + ^ 

' ' yieSyseS I I 

2 1 2 


+ ^7(5,5), 


(31) 


where l{a;GS} is the indicator function that has value 1 if x is in 5 and 0 otherwise. In view of ( fThl l. we then have 


{4){x) - t 4 51 ^^y^) ■ ^ 

' 'yes ' 'yes 

= (1 “ l^ll^eS} + y^)®" + ^( 2 ;, S'), (32) 

where A(x, S) is the triangular distance from a point x to a set S. Thus, the square of the “distance” between a point x and the 
“centroid” of a set S is (1 — ■|^)o’ + A(x, S) for a point x G S and (1 + ■|^)(T + A(x, S) for a point x ^ S. In particular, when 
(7 = 0, the dual AT-sets algorithm is the same as the sequential kernel iT-means algorithm for the kernel T. Unfortunately, the 
matrix T may not be positive semi-definite if a is chosen to be 0. As indicated in ll55l . a large a decreases (resp. increases) 
the distance from a point x to a set S that contains (resp. does not contain) that point. As such, a point is more unlikely to 
move from one set to another set and the kernel iT-means algorithm is thus more likely to be trapped in a local optimum. 

To summarize, the dual iT-sets algorithm operates in the same way as a sequential version of the classical kernel AT-means 
algorithm by viewing the matrix T as a kernel. However, there are two key differences between the dual AT-sets algorithm and 
the classical kernel A'-means algorithm; (i) the dual AT-sets algorithm guarantees the convergence even though the matrix T 
from a cohesion measure is not positive semi-definite, and (ii) the dual AT-sets algorithm can only be operated sequentially and 
the kernel A'-means algorithm can be operated in batches. To further illustrate the difference between these two algorithms, 
we show in the following two examples that a cohesion matrix may not be positive semi-definite and a positive semi-definite 
matrix may not be a cohesion matrix. 


Example 20 In this example, we show there is a cohesion matrix T that is not a positive semi-definite matrix. 


0.44 

0.04 

0.04 

0.04 

-0.56 

0.04 

0.64 

-0.36 

-0.36 

0.04 

0.04 

-0.36 

0.64 

-0.36 

0.04 

0.04 

-0.36 

-0.36 

0.64 

0.04 

-0.56 

0.04 

0.04 

0.04 

0.44 


The eigenvalues of this matrix are —0.2, 0, 1, 1, and 1. 


(33) 


Example 21 In this example, we show there is a positive semi-definite matrix M = {rrnj) that is not an cohesion matrix. 

/ 0.375 -0.025 -0.325 -0.025 \ 


M = 


-0.025 0.875 -0.025 -0.825 

-0.325 -0.025 0.375 -0.025 

V -0.025 -0.825 -0.025 0.875 j 


(34) 


The eigenvalues of this matrix are 0, 0.1, 0.7, and 1.7. Even though the matrix M is symmetric and has all its row sums and 
column sums being 0, it is still not a cohesion matrix as mi 1 — mi _2 — toi ,4 + ^ 2,4 = —0.4 < 0. 
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D. Constructing a cohesion measure from a similarity measure 

A similarity measure is in general defined as a bivariate function of two distinct data points and it is often characterized 
by a square matrix without specifying the diagonal elements. In the following, we show how one can construct a cohesion 
measure from a symmetric bivariate function by further specifying the diagonal elements. Its proof is given in Appendix I. 


Proposition 22 Suppose a bivariate function /3 q : x Q TZ is symmetric, i.e., /3o{x,y) = /3o{y,x). Let I3\{x,y) = /3o{x,y) 
for all X y and specify /3i(a:, x) such that 


Also, let 


Pi(x,x) > max [Pi(x,z) + Pi(x,y) - Pi(y,z)]. 

x^y^z 

p(x, y) = Pi{x, y) - - V Pi{zi,y) 
n 

Z2^^ Z\^Q, Z2^^ 


Then P(x,y) is a cohesion measure for LI. 

We note that one simple choice for specifying Pi{x,x) in dlSl) is to set 

Pl{x, x) = 2/5jiiax Pmin 


(35) 


(36) 


(37) 


where 

/3max = niax/3(x, y), (38) 

x=jty 

and 

/3min = niin/3(x, y). (39) 

x^y 


In particular, if the similarity measure /3(x, y) only has values 0 and 1 as in the adjacency matrix of a simple undirected graph, 
then one can simply choose /3i(x,x) = 2 for all x. 


Example 23 (A cohesion measure for a graph) As an illustrating example, suppose A = (a^j) is the n x n adjacency 
matrix of a simple undirected graph with atj = 1 if there is an edge between node i and node j and 0 otherwise. Let 
ki = j tie the degree of node i and m = ^ tie the total number of edges in the graph. Then one can simply 

let Pi{i,j) = 26ij +aij, where 5ij = 1 if i = j and 0 otherwise. By doing so, we then have the following cohesion measure 


P{iJ) = 26 ij + Oij - 


2 ki 


+ 


2m + 2n 


n n 

We note that such a cohesion measure is known as the deviation to indetermination null model in 


(40) 


VI. Conclusions 

In this paper, we developed a mathematical theory for clustering in metric spaces based on distance measures and cohesion 
measures. A cluster is defined as a set of data points that are cohesive to themselves. The hierarchical agglomerative algorithm 
in Algorithm [T] was shown to converge with a partition of clusters. Our hierarchical agglomerative algorithm differs from a 
standard hierarchical agglomerative algorithm in two aspects: (i) there is a stopping criterion for our algorithm, and (ii) there is 
no need to use the greedy selection. We also proposed the AT-sets algorithm in Algorithmic] based on the concept of triangular 
distance. Such an algorithm appears to be new. Unlike the Lloyd iteration, it only takes one-step minimization in each iteration 
and that might give the AT-sets algorithm the computational advantage over the AT-medoids algorithms. The AT-sets algorithm 
was shown to converge with a partition of two clusters when AT = 2. Another interesting finding of the paper is the duality 
result between a distance measure and a cohesion measure. As such, one can perform clustering either by a distance measure 
or a cohesion measure. In particular, the dual AT-sets algorithm in Algorithm [3] converges in the same way as a sequential 
version of the kernel AT-means algorithm without the need for the cohesion matrix to positive semi-definite. 

There are several possible extensions for our work: 

(i) Asymmetric distance measure: One possible extension is to remove the symmetric property in (D3) for a distance measure. 
Our preliminary result shows that one only needs d(x,x) = 0 in (D2) and the triangular inequality in (D4) for the A"-sets 
algorithm to converge. The key insight for this is that one can replace the original distance measure d(x, y) by a new distance 
measure d{x,y) = d{x,y) d{y,x). By doing so, the new distance measure is symmetric. 





A MATHEMATICAL THEORY EOR CLUSTERING IN METRIC SPACES 


18 


(ii) Distance tneasure without the triangular inequality. Another possible extension is to remove the triangular inequality in 
(D4). However, the if-sets algorithm does not work properly in this setting as the triangular distance is no longer nonnegative. 
In order for the AT-sets algorithm to converge, our preliminary result shows that one can adjust the value of the triangular 
distance based on a weaker notion of cohesion measure. Results along this line will be reported separately. 

(iii) Performance guarantee: Like the AT-means algorithm, the output of the AT-sets algorithm also depends on the initial 
partition. It would be of interest to see if it is possible to derive performance guarantee for the AT-sets algorithm (or the 
optimization problem for the normalized modularity). In particular, the approach by approximation stability in ll2Tll might be 
applicable as their threshold graph lemma seems to hold when one replaces the distance from a point x to its center c, i.e., 
d{x,c), by the average distance of a point x to its set, i.e., d{x,S). 

(iv) Local clustering: The problem of local clustering is to find a cluster that contains a specific point x. Since we already 
define what a cluster is, we may use the hierarchical agglomerative algorithm in Algorithm [T] to find a cluster that contains x. 
One potential problem of such an approach is the output cluster might be very big. Analogous to the concept of community 
strength in O, it would be of interest to define a concept of cluster strength and stop the agglomerative process when the 
desired cluster strength can no longer be met. 

(v) Reduction of computational complexity: Note that the computation complexity for each iteration within the FOR loop of 
the AT-sets algorithm is 0{K-nf) as it takes 0{Kn) steps to compute the triangular distance for each point and there are n 
points that need to be assigned in each iteration. To further reduce the computational complexity for such an algorithm, one 
might exploit the idea of “sparsity” and this can be done by the transformation of distance measure. 
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In this section, we prove Proposition 01 

(i) Since the distance measure is symmetric, we have from Q that 'y{x,y) = 'j{y,x). Thus, the cohesion measure 

between two points is symmetric. 

(ii) We note from (l6]l that 

l{x,x) = ^ (^d{x,zi) + d{z2,x) 

-d{zi,Z2) - d{x,x)y ( 41 ) 

Since d{x, x) = 0, we have from the triangular inequality that -f{x, x) > 0. 

(iii) Note from (HTt and (l6]l that 

l{x,x) -'y{x,y) 

= - ^ (d{z2, x) - d{x, x) + d[x, y) - d{z2, yfj ■ 

Since d{x, x) = 0, we have from the triangular inequality that 

7(x,x) > 'y{x,y). 

(iv) This can be easily verihed by summing y in Q- 


Appendix B 

In this section, we prove Theorem |7] 

We first list several properties for the average distance that will be used in the proof of Theorem |7] 


Fact 24 (i) d{Si,S2)>0; 

(ii) (Symmetry) d(S'i, S' 2 ) = d{S 2 , Si); 

(iii) (Weighted average) Suppose that S 2 and S 3 are two disjoint subsets of VL. Then 


d{Si,S2yjS3) 

\S 2 ' 


|52| + |53| 


d{Si,S 2 ) 


1^31 


I-52I + |53| 




(42) 


Now we can use the average distance to represent the relative distance and the cohesion measure. 

Fact 25 (i) From (O, one can represent the cohesion measure between two points in terms of the average distance as 

follows: 

l{x, y) = d{^, y) + d{x, n) - O) - d(^x, y). (43) 

(ii) From (0]) and ( 1431 ). one can represent the cohesion measure between two sets in terms of the average distance as 
follows: 

y{Si,S2) = \Si\-\S2\- (^d{n,S2) + d{Si,n) 

-d{n,n)-d{Si,S2)). (44) 

(iii) From m, one can represent the relative distance from x to y in terms of the average distance as follows: 

RD{x\\y) = d{x,y) - d{x,n). (45) 

(iv) From ( 1451) and di]), one can represent the relative distance from set Si to another set S 2 in terms of the average 
distance as follows: 

RD{Si\\S 2 ) = diSi,S 2 )-d{Si,n). (46) 

Using the representation of the cohesion measure by the average distance, we show some properties for the cohesion measure 
between two sets in the following proposition. 
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Proposition 26 (i) (Symmetry) For any two sets Si and S 2 , 

7(^1, 52 )= 7 (^ 2 ,5i). (47) 

(ii) (Zero-sum) Any set S is both cohesive and incohesive to O, i.e., 7 ( 0 , S) = 7 ( 5 , fl) = 0. 

(iii) (Union of two disjoint sets) Suppose that S 2 and S 3 are two disjoint subsets of fl. Then for any set Si 

7(5i, 52 U 53 ) = 7(5i, 52 ) + 7(5i, 53 ). (48) 

(iv) (Union of two disjoint sets) Suppose that S 2 and S 3 are two disjoint subsets of fl. Then for any set Si 

7(52 U 53 ,5i) = 7(52,5i) + 7 ( 53 ,5i). (49) 

(v) Suppose that 5 is a nonempty set and it is not equal to f2. Let 5° = n\S be the set of points that are not in S. Then 

7 ( 5 , 5) = -7(5^5) =7(5^5^). (50) 

Proof, (i) The symmetric property follows from the symmetric property of 7 ( 0 :, y). 

(ii) That 7 ( 1 ], 5) = 0 follows trivially from (l44l i. 

(iii) This is a direct consequence of the definition in 

(iv) That ( |49] | holds follows from the symmetric property in (i) and the identity in (l48l l. 

(v) Note from (ii) of this proposition that 7 (n, 5) = 7(5 U 5°, 5) = 0. Using ( l49l l yields 

7(5,5)+7(5^5) = 0 . 

Thus, we have 

7(5,5) = -7(5^5). (51) 

From (fSTl i. we also have 

7(5^5=) =-7(5,5=). (52) 

From the symmetric property in (i) of this proposition, we have 7 ( 5 =, 5) = 7 ( 5 , 5=). As a result of (ISTT i and ( l52l i, we then 
have 

7(5,5)= 7 ( 5 =, 5=). 

■ 

Using the representations of the relative distance and the cohesion measure by the average distance, we show some properties 
for the relative distance in the following proposition. 

Proposition 27 (i) (Zero-sum) The relative distance from any set S to Q is 0, i.e., /?Z)(5||f2) = 0. 

(ii) (Union of two disjoint sets) Suppose that S 2 and S 3 are two disjoint subsets of Ll. Then for any set Si 

RD{Si\\S 2 ^S 3 ) = 

(iii) (Union of two disjoint sets) Suppose that S 2 and S 3 are two disjoint subsets of U. Then for any set Si 

RD{S2^S3\\Si) = ^^i^^/?D(52||5i) 

(iv) (Reciprocity) For any two sets Si and S 2 , 

RD{n\\S2) - RD{Si\\S2) 

= RD{n\\Si)-RD{S2\\Si). (55) 

(v) The cohesion measure can be represented by the relative distance as follows: 

7(5i, 52) = |5i| • |52| • (/?Z)(U||52)-/?Z)(5 i||52)). 

Proof, (i) This is trivial from ( |46] |. 

(ii) and (iii) follows from (l42l i and (l46l l. 


( 56 ) 
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(iv) Note from the symmetric property of d{-, •) that 


RD{n\\S2)-RD{Si\\S2) 

= d{n, S 2 ) - d{n, f]) - {d{Si,S2) - d{Si,n)) 

= d{n, Si) - d{n, n) - {d{S 2 ,Si) - d{S 2 ,n)) 

= RD{n\\Si)-RD{S 2 \\Si) 

(v) This is a direct consequence of (l44l i and (l46l l. 

Proof. (Theorem 13 (i) (ii): If S) > 0, we then have from (l50l) that 

(ii) => (iii); If 7 ( 6 "^, 5"^) > 0, then it follows from (l50l l that 

(iii) => (iv); If 7 ( 6 "^, S) < 0, then we have from dSOl l that 7 ( 5 , S) = S) > 0. Thus, 7 ( 6 ', S) > 7 ( 6 ', 5"^). 

(iv) => (v): If 7 ( 5 ,5”) > 'y{S,S‘^), then it follows from (fSSl) that 


j{S,S)>j{S,S^) = -j{S,S). 


This then leads to 7(6', S) > 0. From (l44l i. we know that 

7(5, s) = • ( 2 d{s, n) - d{n, n) - d{s, s)^ > o. 

Thus, 2d{S, n) - fl) - diS, s) > 0 . 

(v) ^ (vi); Note from (|46] | that 


RD(fl||S') -RD(S'||5') 

= d{n, s) - d{n, ii) - {d{s, s) - d{s, n)) 

= 2 d{s, n) - d{ii, n) - d{s, s). 

Thus, if 2d{S, n) - d{il, n) - d{S, S) > 0, then RD(fl||S') > RD(S'||5'). 

(vi) => (vii): From (|54] | in Proposition IZTl' iii). we have 

RD(fl||5') = RD(S U S'^WS) = ■i^RD(S'||5') + ^RD(5'‘=||5'). 

n n 


Thus, 

RD(fl||5) -RD(5'||5') = (rD(S'=||5') - RD(5'||5')). 

Clearly, if RD(fl||S') > RD(5'||S'), then RD(5'=||S') > RD(S'||S'). 

(vii) => (viii); Note from (143 and (l46l l that 


RD(S'^||5') -RD(S'IIS') 

= d{S‘^, S) - d{S\ n) - {d{S, S) - d{S, n)) 

= d{S‘^, S) - SUS‘^)- d{S, S) + d{S, S U 5'°) 

= (2d{S, 5") - d{S, S) - S‘=)'^. 

Thus, if RD(S'^IIS') > RD(5'||S'), then 2d{S, 5") - d{S, S) - d{S^, S'") > 0. 

(viii) (ix): Note from (H3 and (l46l l that 


RD(S||S") -RD(fl||S") 

= d{s, s") - d{s, n) - {d{n, s") - d{n, n)) 

= d{S, S") - d{S, S U S") - d{S U S", S") 
+J(SUS",SUS") 

= ^ (2d(S, S") - d{S, S) - J(S", S")) . 

Thus, if 2d(S, S") - J(S, S) - > 0, then RD(S||S") > RD(fl||S"). 

(ix) => (x): Note from (l55l l in Proposition IZTl'iv') that 

RD(S"||S) - RD(fl||S) = RD(S||S") - RD(fl||S"). 


(57) 


(58) 


(59) 
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Thus, if RD(S'||5'^) > RD(f7||S'^), then RD(5=||5) > RD(f7||5). 

(x) ^ (i): Note from (|50] | and ( |56] | that 

j{S,S) = -7(5,5=) = -|5| ■ |5=| • (rD(T!||5) -RD(,5=||5)). 

Thus, if RD(5'=||5) > RD(f]||5'), then 7 ( 5 , 5 ) > 0. 


Appendix C 

In this section, we prove Theorem |9] 

(i) We prove this by induction. Since 7 ( 0 ;, cc) > 0, every point is a cluster by itself. Thus, all the initial n sets are disjoint 
clusters. Assume that all the remaining sets are clusters as our induction hypothesis. In each iteration, we merge two disjoint 
cohesive clusters. Suppose that Si and Sj are merged to form Sk- It then follows from (l48T l and ( |49] | in Proposition |26jiii) and 
(iv) that 

7(5fc, Sk) = 5.) + 27 ( 5 ., Sj) + 7(5j, 5j). (60) 

As both Si and Sj are clusters from our induction hypothesis, we have j{Si,Si) > 0 and 7 (S'j, 5 j) > 0. Also, since Si and 
Sj are cohesive, i.e., 7 ( 5 ^, Sj) > 0, we then have from (I 6 OI 1 that 7 ( 5 fc, Sk) > 0 and the set Sk is also a cluster. 

(ii) To see that the modularity index is non-decreasing in every iteration, note from dhOl l and 7 ( 5 ^, Sj) > 0 that 

7(5fe,5fc) >7(5 z,50+7(5„5,). 

As such, the algorithm converges to a local optimum. 


Appendix D 

In this appendix, we prove Lemma [13] 

(i) From the triangular inequality, it is easy to see from the dehnition of the triangular distance in (O that A(a:, 5) > 0. 
Note that 


|5| 


d{x,zi) = XI = c^({^}^5). 




2iGS 


Similarly, 

X XI ^ 2 ) = di{x}, 5). 

' ' zieSz2GS 


Thus, the triangular distance in (fT^ can also be written as A(a:, 5) = 2d{{x}, 5) — d(5, 5). 
(ii) Recall from (l44l) that 

7(5i,52) = \Si\-\S2\- (d{n,S2) + d{Si,n) 

-d{n,n)-diSi,S2)). 


Thus, 

lix, x) - |^7({2;}, 5) + 1 ^ 7 ( 5 ,5) 

= 2d({x},n) - d(n,fi) - 2(^d({x},fi) + d(s,n) 
-d(n,n)-d({x},s)J 

+ ( 2 d{n, s) - d{n, n) - d{s, 5)) 

= 2d{{x}, 5) - J(5,5) = A{x, 5), 


where we use (i) of the lemma in the last equality. 


A MATHEMATICAL THEORY EOR CLUSTERING IN METRIC SPACES 


24 


(iii) Note from (ii) that 


^ ^ Aix,Sk) 

k—1xGSk 
K 

= (7(2;, a;) - 

k—1x^Sk 


1*5^1 

^7(S'fc,S'fc)) 


l^fcl 


K 


K 


k=lx^Sk fc=l ^ 

= X 7(a;,a:) - R. 




(iv) From (i) of this lemma, we have 

K 


K 


Observe that 


Thus, 


XX A{x,Sk) = XX {2d{{x},Sk)-d{Sk,Sk)). 

k—lx^Sk k—lxGSk 

X d{{x},Sk) = \Sk\d{Sk,Sk). 

x^Sk 


K K 

XX A(x,5fe) = X l^fel • (24(^fc,&) - d{Sk,Sk)) 

k—1xGSk k—1 

K K 

(i({a;}, 5 'fe) 

k—1 k—1x£Sk 

^ ^ (i(x, 

aiCO 

Appendix E 

In this section, we prove Theorem [14] For this, we need to prove the following two inequalities. 
Lemma 28 For any set S and any point x that is not in S, 

X -S'u {x}) < X 

ySSUfx} ygSUfx} 

and 

XA(t/,5)<XA(y,5u{x}). 
yes yes 

Proof. We first show that for any set S and any point x that is not in S, 

d{S U {x}, S U {x}) - 2d{S U {x}, S') + d{S, S) < 0. 

From the symmetric property in Fact l24l ii') and the weighted average property in Fact l24[ iii'). we have 

|S|^ 


d{S U {x}, S U {x}) = 7 T 7 ;hA_^d(S, S) 


(|5| + 1)^ 


, 2|S| 

+ 7T7Tr—TTT7rf({x}, S) 


1 


(|5| + 1)2 


and 


Note that 


d(SU{x},S) = 


|5| 


-d(S,S) 


i\s\ + ir 

1 


|S| + 1 ^ ’ |5| + 1 

d({x}, {x}) = d(x, x) = 0. 


d({x},{x}), 

4({x},S). 


( 61 ) 


(62) 


(63) 


( 64 ) 
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Thus, 

d{S U {x], S U {x}) - 2d{S U {a;}, S) + d{S, S) 

= (|g| + i )2 {diS, S) - 2d{{x}, S)) < 0, 

where we use (fTSI) in the last inequality. 

Note from (fTSl l that 

^ A{y,S,)= 

yeS2 yeS2 

= |^2| • (2d{Si,S2)-d{SuSi)y (65) 

Using dhSl) yields 

^ A{y,SU{x})- J2 

!/eSU{x} !/GSU{a:} 

= [S' U {x}| • (2d{S U {x}, S U {x}) 

—d{S U {x}, S U {x})^ 

-IS" U {x}| • (2d{S, S U {x}) - d{S, 5)) 

= [S' U {x}| • (^d{S U {x}, S U {x}) 

-2d{S,S Li {x}) + d{S,S)y ( 66 ) 

As a result of (l64l) . we then have 

A(y,5'U{x})- Y ^iy^S)<0. 

yeSU{x} i/GSU{a:} 

Similarly, using (l65]) and dMI) yields 

^A(y,5)-^A(2/,5U{x}) 
yes y&S 

= \S\- (2d{S,S)-d{S,S)) 

-|5| • (2d{S U {x}, S) - d{S U {x}, S U {x})) 

= \S\ ■ (j(5'U{x},5'U{x}) 

-2d{SL{x},S)+d{S, S')) 

< 0 (67) 


Proof. (Theorem [14) (i) Let Sk (resp. S^), A: = 1,2,..., AT, be the partition before (resp. after) the change. Also let c(x) 
be the index of the set to which x belongs. Suppose that A(xi, Sk*) < A{xi, Sc(xi)) and Xi is moved from Sc{xi) to Sk* for 
some point xt and some k*. In this case, we have S^. = Sk* U {xi}, = S'^^.^\{x} and S^ = Sk for all k ^ c(xi), k*. 
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It then follows from A{xi^Sk*) < A{xi^ Sc{xi)) that 


K 


A(x,5'fc) 

k—1xGSk 

= X! X! A(a;,5'fe) 

k^c{xi),k* x€.Sk 

+ ^ A(a;,S'c(a:,)) + 

x€.Sc(x-) ccCSfc* 

= ^Mx,Sk)+ A(a;,5'c(;^.)) 

k^c{xi),k* x^Sk ^^Sc(xi)\{^i} 

+A{xi,Sc(xi)) + X! A(a;,S'?:.) 

xeSfc. 

> '^Mx,Sk)+ 

k^c(Xi),k* X^Sk a:CSc(a: )\{^i} 

+A{xi,Sk») + ^ A(a;, S'fc.) 

xGSk* 

k^c(Xi),k* X^S'p, a^CSc(a: )\{^i} 

+ X] A(x, S'fc.), (68) 

where we use the fact that S^ = Sk for all k ^ c{xi), k*, in the last equality. From (l62l i and S^. = Sk* U {li}, know that 


A(x, Sfe.) > A(x, Sfe. U (xj) 


aiC5fc*U{xi} 


a;C5fc*U{xi} 


(69) 


Also, it follows from (|6^ and = S'^^.^\{a;i} that 


E A(a;,Sc(2,,)) > 

ceSo(x^)\{a:i} 


Using ( |69] | and (fTOl i in (l68l l yields 


E A(a;,Sc(2,,)\{xJ) 


xGS' 


(.^i) 


K 


E E A(a;,Sfc) 

k—1xGSk 

> E E E 

k^c{xi),k* xC5E 


E 


c{xi) ) 


x^S', v 


K 


= E E 


k—1 x^S'u 


(70) 


(71) 


In view of (fTTl i and (fTTI i. we then conclude that the normalized modularity is increasing when there is a change. Since there 
is only a finite number of partitions for U, the algorithm thus converges to a local optimum of the normalized modularity. 
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(ii) The algorithm converges when there are no further changes. As such, we know for any x G Si and j ^ z, A{x,Si) < 
A(x, Sj). Summing up all the points x G Si and using < fT5] t yields 

0 > 

x^Si 

= Y, (2d{{x},S,)-d{S,,S,)) 

xGSi 

- Y {2d{{x},S,)-diS,,S,)) 

xGSi 

= |5,| ■ S,) - 2J(5„ S,) + d~(S„S,)). (72) 

When the two sets Si and Sj are viewed in isolation (by removing the data points not in Si U Sj from fl), we have Sj = S![. 
Thus, 

S,) - 2d{S,, Sf) + d{Sf, S^) < 0. 

As a result of Theorem |2lviii), we conclude that Si is a cluster when the two sets Si and Sj are viewed in isolation. Also, 
Theorem |2lii) implies that Sj = Sf is also a cluster when the two sets Si and Sj are viewed in isolation. ■ 

Appendix F 

In this section, we prove Lemma [Thl 

We first show (Cl). Since d(x, y) = d{y^ x) for all x ^ y, we have from dTTTi that f3{x, y) = /3(y, x) for ?\\ x ^ y 
To verify (C2), note from (l2lTi that 

YPix,y) = i y] y] d(z2,t/) + y] 

" “ H H c;(^ 2 , ^i) - y] d{x, y) 

= 0 . 

Now we show (C3). Note from (ITTI i that 

I3{x, x) + /3(y, z) - P(x, z) - P(x, y) 

= -d(x, x) - d(y, z) + d{x, z) + d{x, y). 

Since d{x,x) = 0, it then follows from the triangular inequality for d{-, ■) that 

P(x, x) + j3{y, z) - Pix, z) - P{x, y) > 0. 


(73) 


(74) 


Appendix G 

In this section, we prove Lemma [m 

Clearly, d{x,x) = 0 from (l22li and thus (D2) holds trivially. That (D3) holds follows from the symmetric property in (Cl) 
of Dehnition [T3 

To see (Dl), choosing z = y in (l20l i yields 

0 < P{x, x) + P{y, y) - P{x, y) - P{x, y) = 2d{x, y). 

For the triangular inequality in (D4), note from (l22Ti and (l20l i in (C3) that 


d{x,z) +d{z,y) -d{x,y) 
{P{x,x) + P{z,z)) 


- P{x,z) + 


(/3(z,z) +/3(y,y)) 


= P{Zi z) + P{x, y) - p{z, x) - P{z, y) > 0. 
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Appendix H 

In this section, we prove Theorem [TSl 

We first show that d**{x,y) = d{x,y) for a distance measure d{-, •). Note from (|2^ and d{x,x) = 0 that 

d*{x,x) = —'S^ d(z 2 ,a:) + — d{x,zi) 
n ^—' n ^' 

d{z2,zi). 

From the symmetric property of d{-,-), it then follows that 

d*{x,x) = ^ X! X! ^(^ 2 ,Zi). 

Z\^Q. Z2^^ Z\^VL 

Similarly, 

d*{y,y) = - X] d{z2,y) - ^ d{z2,zi). 

Z2^^ 22€n2i€n 

Using (|2^ . (|7^ and (ITTI i in (l24li yields 

d**{x,y) = {d*{x,x) + d*{y,y))/2 - d*{x,y) = d{x,y). 

Now we show that f3**{x,y) = (3{x,y) for a cohesion measure 13{-, ■). Note from (l24l i that 


I3*{z2,y) + I3*{x,zi) - I3 *{zi,Z 2) - I3*{x,y) 
= -P{z 2 , y) - /3(a;, zi) + /3(zi, ^ 2 ) + ld{x, y). 

Also, we have from (l23t that 

I3**{x,y) 

= - X] Id*{z2,y) + - X ld*{^^zi) 

22€r2 2i€r2 

X X -/?*(x,2/) 

22 21 

= ^X X {p*{z 2 ,y)+ P*{x,Zi) 

22€n 2iGn 

-P*{zi,Z2) - P*{x,y)y 

Using (|79] | in (ISOl l yields 


(75) 

(76) 

(77) 

(78) 

(79) 


(80) 


ld**{x,y) 

= ;^X X {ld(.x,y)+ P{zi,Z 2 ) 

-/3(x,zi) - P{z2,y)^ 

= P{x,y) + y^J2 X X 

22£r2 2i£r2 2i£r2 

--X^(^2,y)- (81) 

ziGO 

Since /3(-, •) is a cohesion measure that satisfies (C1)-(C3), we have from (Cl) and (C2) that the last three terms in dSTT i are 
all equal to 0. Thus, j3**{x,y) = I3{x,y). 
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Appendix I 


In this section, we prove Proposition |22] 

We first show (Cl). Since Pi{x,y) = (3o{x,y) for all a; ^ y, we have from the symmetric property of /3o(t) that 
Pi{x,y) = Pi{y,x) for all x ^y. In view of (1^ . we then also have f3{x,y) = /3{y,x) for all x ^ y. 

To verify (C2), note from (O that 


yen 


Pi{zi,y) 

yen yGOziCn 

H /?i(a;,z2) + i X! H Pl{Zi,Z2) 

Z\^VL 

0 . 


(82) 


Now we show (C3). Note from (l36l l that 


(i{x, x) + /3{y, z) - P{x, z) - /3(x, y) 

= /3i(x,x) +/3i(y, z) - I3i{x,z) - /3i(x, y). (83) 

It then follows from (iTSl l that for sW x ^ y ^ z that 

/3(a:, a:) + /3(y, z) - /3(a;, z) - /3(a;, y) > 0. (84) 


If either a: = y or a; = z, we also have 

/3(a:, a:) + /3(y, z) - Pix, z) - P{x, y) = 0. 

Thus, it remains to show the case that y = z and x ^ y. For this case, we need to show that 

P{x, x) + /3(y, y) - /3(x, y) - /3(x, y) > 0. (85) 

Note from (l8^ that 

Piy, y) + p{x, z) - /3(y, z) - /3(y, x) > 0. (86) 

Summing the two inequalities in (l84l i and (l86l l and using the symmetric property of /3(-, •) yields the desired inequality in 


